Brain tumour segmentation with incomplete imaging data

Abstract Progress in neuro-oncology is increasingly recognized to be obstructed by the marked heterogeneity—genetic, pathological, and clinical—of brain tumours. If the treatment susceptibilities and outcomes of individual patients differ widely, determined by the interactions of many multimodal characteristics, then large-scale, fully-inclusive, richly phenotyped data—including imaging—will be needed to predict them at the individual level. Such data can realistically be acquired only in the routine clinical stream, where its quality is inevitably degraded by the constraints of real-world clinical care. Although contemporary machine learning could theoretically provide a solution to this task, especially in the domain of imaging, its ability to cope with realistic, incomplete, low-quality data is yet to be determined. In the largest and most comprehensive study of its kind, applying state-of-the-art brain tumour segmentation models to large scale, multi-site MRI data of 1251 individuals, here we quantify the comparative fidelity of automated segmentation models drawn from MR data replicating the various levels of completeness observed in real life. We demonstrate that models trained on incomplete data can segment lesions very well, often equivalently to those trained on the full completement of images, exhibiting Dice coefficients of 0.907 (single sequence) to 0.945 (complete set) for whole tumours and 0.701 (single sequence) to 0.891 (complete set) for component tissue types. This finding opens the door both to the application of segmentation models to large-scale historical data, for the purpose of building treatment and outcome predictive models, and their application to real-world clinical care. We further ascertain that segmentation models can accurately detect enhancing tumour in the absence of contrast-enhancing imaging, quantifying the burden of enhancing tumour with an R2 > 0.97, varying negligibly with lesion morphology. Such models can quantify enhancing tumour without the administration of intravenous contrast, inviting a revision of the notion of tumour enhancement if the same information can be extracted without contrast-enhanced imaging. Our analysis includes validation on a heterogeneous, real-world 50 patient sample of brain tumour imaging acquired over the last 15 years at our tertiary centre, demonstrating maintained accuracy even on non-isotropic MRI acquisitions, or even on complex post-operative imaging with tumour recurrence. This work substantially extends the translational opportunity for quantitative analysis to clinical situations where the full complement of sequences is not available and potentially enables the characterization of contrast-enhanced regions where contrast administration is infeasible or undesirable.


Introduction
Progress in neuro-oncology is increasingly recognized to be obstructed by the marked heterogeneity-genetic, pathological, and clinical-of brain tumours. If the treatment susceptibilities and outcomes of individual patients differ widely, determined by the interactions of many multimodal characteristics, 1 then large-scale, fully-inclusive, richly phenotyped data-including imaging-will be needed to predict them at the individual level. Such data can realistically be acquired only in the routine clinical stream, where its quality is inevitably degraded by the constraints of real-world clinical care. Although contemporary machine learning could theoretically provide a solution to this task, especially in the domain of imaging, its ability to cope with realistic, incomplete, low-quality data is yet to be determined.
Over the last few decades, lesion segmentation has formed a cornerstone of innovation across the domains of neuro-oncology, 2-4 medical imaging, 5,6 biomedical engineering, 7 machine, and deep learning. 8 The ability to segment an anatomical or pathological lesion in 3D confers the ability to evaluate it quantitatively-moving beyond visual qualitative assessment-with greater richness and fidelity than conventional 2D measurements repeatedly shown to be often spurious and inconsistent between radiologists, [9][10][11] and with greater sensitivity to the heterogeneity of the underlying pathological patterns. 12 Enabling radiological image segmentation opens a wide array of possibilities for downstream innovation in neuro-oncological healthcare and research, ranging from clinical stratification, outcome prediction, response assessment, treatment allocation, and risk quantification, many of which have already shown great promise. The underlying goal is to enhance the individual fidelity of data-driven decision-making, facilitating better patient-centred care, [13][14][15] a remit especially warranted in neuro-oncology.
The segmentation of brain tumours remains a particularly challenging task owing to the marked heterogeneity of their imaging appearances: spatial distribution, morphology, signal characteristics, and impact on adjacent healthy anatomical structures. [16][17][18] Its difficulty has even inspired an international competition for cutting-edge deep learning groups to create the best segmentation model. Known as the brain tumour segmentation challenge (BraTS), it is attracting increasing attention as well as support from both the Radiological Society of North America and the American Society of Neuroradiology, providing large-scale data with multimodal MRI-fluid-attenuated inversion recovery (FLAIR), T1, T2, and contrast-enhanced T1 (T1CE) sequences-as well as the labelled ground-truths of oedema, non-enhancing, and enhancing tumour. 8,19,20 But while benchmark tasks have unquestionably aided the advancement of lesion segmentation-indeed of computer vision generally-they have compelled a research focus on developing uniformly multimodal models trained on sequence-complete acquisition sets, often rare in real-world clinical practice. The causes of incomplete data are legion, but common examples include patient contraindications to contrast, corruption by image artefacts, and image acquisition constraints such as those imposed in pre-operative stealth studies. Taking just one of many possible causes for image degradation, the prevalence of motion artefact has been reported as 7.5% of outpatient and 29.4% of inpatient MRI studies, with an estimated economic impact of $115 000 per scanner, per year. 21 The real-world utility of tumour segmentation must lie within the clinical domain, such as for treatment planning and monitoring across neuro-oncology. Yet, the ability to undertake segmentation in these real-world clinical situations, where complete-'perfect'-data is scarce, remains completely unknown. How well do contemporary segmentation modelling architectures perform when trained on sequence-incomplete data, and what features of the lesion are correctly identifiable under such circumstances?
Here, we aimed to systematically quantify and answer these questions with the largest and most comprehensive study of its kind based on the application of state-of-the-art deep learning tumour segmentation models to large-scale MRI of brain tumours. We hypothesized that the decrement in segmentation performance with the loss of sequences would be modest, rendering good quality segmentations feasible with incomplete data.

Data
The study was approved by the local ethics committee. We received ethical permission for the consentless analysis of irrevocably anonymized data collected during routine clinical care.
We used the BraTS 2021 challenge data for all model training. This dataset is described in detail by its curators elsewhere. 20,22,23 In brief, it includes a large retrospective sample of multi-institutional brain tumour MRI scans, with heterogeneous equipment, protocols, and image quality. The following sequences are included: T1-weighted, T2-weighted, FLAIR, and T1CE, with a pre-processing pipeline consisting of image co-registration, sampling to a 1 mm 3 isotropic space, and brain extraction. Lesions were segmented with an ensemble of previous top-ranking BraTS algorithms with subsequent manual refinement and checking by a panel of board-certified attending neuroradiologists with more than 15 years of clinical experience in neuro-oncology. 24 We used the training set of 1251 individuals of the BraTS 2021 challenge data-comprising 5004 separate images-as this group included all ground-truth labels for model cross-validation.
Having trained and evaluated a set of models on the BraTS 2021 challenge data, we sought to separately evaluate their performance on an additional held-out population from our own centre. The aim of this was to provide an additional robust safeguard of model performance with international and external validation. Specifically, we acquired retrospective imaging for a random sample of 50 individuals who underwent gadolinium-enhanced MRI head studies between 2006 and 2021 for a known glioblastoma as part of their routine clinical care at our centre. The random allocation of year selected was to further instil heterogeneity to our sample, as data would be acquired over one of 11 possible MRI scanners of both 1.5 (n = 5) and 3 T (n = 6) field strengths, from multiple different manufacturers, and over a 15-year period. Moreover, of our 50 participants, we also chose to include 10 of those with post-operative imaging and evidential tumour recurrence. This choice increased the difficulty of the task, for a model would need to recognize post-operative resection/surgical bed as separate from the subsequent disease recurrence, as well as capturing the instrumental heterogeneity of different MRI machines distributed in time and place.
Most of our sample did not include volumetric imaging, a reflection of local clinical practice at the time of acquisition. To improve harmonization, we therefore employed superresolution in the processing pipeline. 25,26 The pipeline yielded data in a similar format to the BraTS challenge data 20 with 1 mm 3 isotropic and brain-extracted multisequence data. Lesions were hand-labelled with ITK-SNAP by a neuroradiology fellow with 3 years of experience working with brain tumour imaging, with additional aids of the ITK-SNAP semi-automated segmentation tools, namely random forest based classifiers with subsequent manual refinement. 27 Tumour annotations conform to established tissue class labels comprising gadolinium-enhancing tumour, peritumoural oedema/invaded tissue, and non-enhancing tumour/necrotic tumour core. 19 The detailed description of these components is beyond the scope of this article and is discussed elsewhere. 19,20 In brief, enhancing tumour refers to regions with visible enhancement on a T1CE sequence after gadolinium administration. Non-enhancing tumour/necrotic tumour core refers to the part of the tumour that does not enhance after gadolinium, typically deep to the enhancement, while oedema/invaded tissue refers to the peritumoural oedematous and/or infiltrated brain parenchyma, typified by hyperintensity on T2 and FLAIR sequences.

Algorithm
Our task was not to propose a new architecture superior to those already evidenced by the BraTS 2021 challenge. Rather, we sought to characterize, evaluate, and quantify the variation in model performance with increasingly incomplete data, as a proxy index of translational potential across the variety of clinical situations where full complete datasets rarely occur. We chose the nnU-Net ('no new net') self-configuring deep learning biomedical image segmentation modelling architecture, 28 which notably won both the medical segmentation decathlon and the 2020 BraTS challenge. 29,30 In brief, this segmentation method is able to automatically configure itself, including in pre-processing, architecture, training and postprocessing across any task, and has been shown to be a superior methodology across a range of public datasets and tasks, including brain tumour segmentation. 28 Our choice was guided by its excellent performance and the simple, largely automated processing and training cycle, which made development across many models at scale feasible.
Each nnU-Net 28 is in particular a self-configuring U-Net, 31 incorporating the standard encoder-decoder architecture and skip connections, instance normalization and leaky rectified linear units. We used the high-resolution 3D architectural formulation in all experiments. The nnU-Net approach employs a polynomially decaying learning rate, initially set to 0.01, with stochastic gradient descent optimization. The loss function is a weighted sum of the Sørenson-Dice coefficient and cross-entropy. Training data is augmented on the fly, including with rotations, scaling, Gaussian noise and blur, brightness and contrast shifting, and gamma correction. Patch and batch size are also selfconfigured. Model training utilizes 1000 epochs, with foreground oversampling to mitigate the impact of class imbalances. We used 5-fold cross-validation for each experiment and its evaluation with the BraTS 2021 challenge data, as well as additional external/international out-of-sample evaluation of models with the additional data from our own centre as detailed above. A schematic of the model architecture is shown in Supplementary Fig. 1.

Statistical analysis and performance evaluation
We trained all possible combinations of the MRI sequences T1, T2, FLAIR, and T1CE as separate models. This included all models using only a single sequence, two sequences, three sequences, and finally a complete four-sequence model. We also trained separate models for abnormality detection (i.e. a binary lesion mask to detect and segment the whole tumour) as well as tumour segmentation with the tissue classes of oedema, enhancing and non-enhancing tumour. This approach comprised 30 different models in total.
Performance was principally quantified by the out-ofsample Sørenson-Dice coefficient between ground truth and inferred labels, 32,33 in accordance with typical research practices. 3,8 This metric derives the area of overlap between the model prediction and the labelled ground-truth. The Sørenson-Dice coefficient, or Dice coefficient, is given as: where TP is true positive, FP is false positive, and FN is false negative.
We also quantified overall model accuracy, false discovery rate, false negative rate, false omission rate, false positive rate, negative predictive value, precision, and recall, ensuring a broad range of possible performance metrics. 34 All listed metrics were derived for whole tumour and the separate tissue constituents of oedema, enhancing and non-enhancing tumour, including with 95% confidence intervals (CIs), which are provided in detail throughout the Supplementary material (Supplementary Table 1 We constructed regression models between ground truth tumour volumes and model predictions, reporting the R 2 . We acquired the acquisition times of contemporaneous imaging protocols at our centre for a given imaging sequence, to allow comparison between a gain in model performance aligned to the time it would take to be acquired. Lastly, we applied t-distributed stochastic neighbour embedding (tSNE) 35 -a nonlinear dimensionality reduction technique-to the contrast-enhancing components of all lesions in the BraTS dataset to create a 2D representation of the lesions, projecting their high-dimensional similarities and differences into a readily surveyable space. We overlaid lesion volume and the Sørenson-Dice coefficient of lesion segmentations to display any variation in these indices with the morphology of the lesion.

Data and code availability
All trained model weights, source code, and usage documentation are publicly availble at https://github.com/highdimensional/tumour-seg. All BraTS 2021 challenge data are readily available from the challenge website at http:// braintumorsegmentation.org 8,20 . Original nnU-Net source code is available at https://github.com/MIC-DKFZ/nnUNet 28 . Patient imaging data from our external validation site is not available for dissemination under the ethical framework that governs its use.

Compute
All models were trained on an NVIDIA DGX-1 with 8 16GB Tesla P100 GPUs. With approximately 3.5 days to train a single model, the task required just over 13 days utilization of all cards.

Incremental performance with sequence addition
All models performed well on whole tumour segmentation qualitatively, despite varying degrees sequence-completeness, with quantitative performance ranging from a Dice coefficient of 0.907 (95% CI 0.904-0.910) (single sequence) to 0.945 (95% CI 0.943-0.947) (complete sequence set) (Fig. 1). Results for segmentation of the oedema, enhancing, and non-enhancing components were more variable, with Dice coefficients ranging from 0.701 (95% CI 0.689-0.713) [single sequence (FLAIR) segmenting non-enhancing tumour] to 0.891 (95% CI 0.886-0.896) [complete sequence set (T1 + T2 + FLAIR + T1CE segmenting oedema)]. Of note, the models that performed the poorest typically struggled in the segmentation of the non-enhancing tumour component, particularly affecting single sequence models of T1, T2, and FLAIR, two-and three-sequence models employing combinations of the former (i.e. with the omission of contrast). There was no evidence of model over-fitting when reviewing the training/validation curves. We provide the full breakdown of Dice coefficients for all models in Fig. 1. Example image segmentations across the range of all models are provided in Figs. 2 and 3, which visually illustrate excellent coverage of the lesion by the models, with relatively little error. We additionally detail model accuracy, false discovery rate, false negative rate, false omission rate, false positive rate, negative predictive value, precision, and recall (all with 95% CIs), for whole tumour and the separate tissue constituents of oedema, enhancing and non-enhancing tumour, all of which is provided within the Supplementary material (Supplementary Table 1, Supplementary Table 2,  Supplementary Table 3, Supplementary Table 4).

Trade-off between acquisition time and segmentation fidelity
We aligned the acquisition times of all possible combinations of sequences using contemporaneous scanner protocol data at our centre and from which determined the gain in model fidelity in Dice per scanning minute (Fig. 1). This demonstrated that certain combinations of sequences appeared to offer greater gains in segmentation performance when compared with others, offering an insight into the efficiency of data acquisition in this clinical context. For instance, it was noted that whilst a single volumetric T1CE acquisition (proxy for a contrast-enhanced MRI stealth study for neurosurgical planning) took 3.1 minutes, achieving a whole tumour Dice coefficient of 0.908 and reasonable performance on individual components (Fig. 1), the addition of FLAIR raised total scanning time to only 4.9 minutes while improving whole tumour Dice to 0.943, just below the best performing model with all four sequences (Dice coefficient 0.945). Similarly, the three-sequence acquisition of FLAIR + T1CE + T2 (i.e. neglecting the pre-contrast T1) achieved Dice coefficients for whole tumour segmentation essentially equivalent to that of the complete four sequences, and reduced scanning time by 33%, from 9.48 to 6.38 minutes. We do, of course, note the omission of a pre-contrast T1 brings its own issues in delineating contrast from, for example, haemorrhage but is nonetheless a striking illustration of how models with incomplete data still achieved comparable performance.

Segmenting enhancing tumour without contrast-enhanced imaging
Interestingly, we discovered that models without contrast-enhanced imaging could still delineate tumours relatively well (Figs. 4 and 5). Models without contrast imaging segmented whole tumour lesions with Dice coefficients ranging from 0.907 (95% CI 0.904-0.910) (single sequence -T1) to 0.942 (95% CI 0.940-0.945) (three sequences-FLAIR + T1 + T2). Of note, this latter performance was only just shy of the best performing full four sequence model with Dice of 0.945 (95% CI 0.943-0.947). Furthermore, models without the contrast-enhanced T1 sequence could still identify the enhancing tumour component well, with Dice coefficients ranging from 0.756 (95% CI 0.748-0.765) (single sequence-T1) to 0.790 (95% CI 0.782-0.798) (three sequences-FLAIR + T1 + T2) (Figs. 4 and 5). This included the model's ability to identify and segment lesions where the focus of enhancing tumour was less than 7 mm in diameter (Fig. 5). The volume of enhancing tumour was highly significantly correlated to that of all model predictions, even despite contrast-enhanced imaging not being provided. The relationship between actual enhancing tumour volume to that of the model predictions with the following inputs were as follows: FLAIR alone (R 2 0.964); T1 alone (R 2 0.953); T2 alone (R 2 0.966); FLAIR + T1 (R 2 0.973); FLAIR + T2 (R 2 0.976); T1 + T2 (R 2 0.962); FLAIR + T1 + T2 (R 2 0.972). Furthermore, inspection of the t-SNE-derived low dimensional representation of the lesions did not reveal any clear relation between lesion anatomy and segmentation performance across models lacking contrast-enhanced sequences (Fig. 6), other than as expected with lesion size, 36 suggesting broad invariance to spatially defined anatomical features.

International clinical validation Whole tumour segmentation
We evaluated the performance of all trained models on an out-of-sample cohort of 50 patients from our own centre in which lesions were hand-labelled, with scans acquired on both 1.5 and 3 T scanners, and with a mixture of pre-and post-operative imaging. The cross-validation performances of all models from the BraTS data were well reproducible on our own data, with Dice coefficients for all models significantly correlated (r = 0.97, P < 0.0001) (Fig. 7). This was despite the multiple steps taken to deliberately make data more heterogenous and liable to error. As expected, models with single imaging modalities, such as T1 or T2 sequences alone, performed worst, with incremental gains in performance with alternative and supplementary modalities.

Tissue class segmentation
We manually reviewed the tissue segmentations of our own data predicted by the complete four-sequence model and determined the model's performances classifying tumour by  subclasses of non-enhancing tumour, enhancing tumour, and oedema were qualitatively more accurate than our semiautomated hand-segmentation. Akin to the method employed by the BraTS 2021 challenge, 20 we therefore then utilized these model predictions using complete imaging sets as our new ground-truth with subsequent manual checking and refinement where required. We then compared the performance of all other models, i.e. those without four sequences, to this revised ground-truth. Model performances were again highly reproducible between the BraTS 2021 challenge data and that of our own external sample, with Dice coefficients significantly correlated (r = 0.95, P < 0.0001) (Fig. 7). As is usually the case in brain tumour segmentation models, segmentations for the non-enhancing tumour component fared worst-especially those with single imaging modalities, whilst prediction of enhancing tumour or oedema fared much better.
We applied our segmentation pipeline to a single patient from our own centre with variable quality (and availability) of imaging during their routine clinical care between 2010 and 2015. We also used this to quantitatively demonstrate lesion volumetry across this time, showing treatment response in early years, followed by stability, and later disease progression (Fig. 8).

Discussion
We have systematically surveyed the ability of stateof-the-art tumour segmentation models in delineating and quantifying brain tumour components in real-world clinical situations of incomplete and/or low-quality data. We reveal there is surprisingly little variation in the performance of segmenting a whole tumour with the number of modelled imaging modalities. Greater variation is observed when segmenting tumour components: a clear pattern of incremental improvement with the addition of further sequences emerges. These findings open the door both to the application of segmentation models to large-scale historical data, for the purpose of building treatment and outcome predictive models, and their deployment to real-world clinical care.
Strikingly, we find that segmentation models trained without contrast-enhancing imaging still characterize the anatomy of enhancing tumour components remarkably well. This includes quantification of the volumetric burden of enhancing tumour with high accuracy. Out-of-sample validation illustrates strong generalizability of these findings, including across super-resolved non-isotropic acquisitions, in varying MRI field strengths, and in tumour recurrences on complex postoperative imaging of limited quality. Our analyses show that current segmentation models generalize surprisingly well to real-world clinical imaging varying in quality and sequence completeness. We also use a case-based example (Fig. 8) to demonstrate how this might factor into the clinical workflow, which in this case was achieved using a Docker container with Python, the software requirements as detailed in the methods, and the trained tumour segmentation model weights (all of which is openly available at https://github.com/high-dimensional/tumour-seg).

Additive value of multiple sequences
Model fidelity unsurprisingly rose with the number of modelled sequences. What is, however, surprising is the ability of models based on limited data to delineate lesions very well. This is particularly striking in the segmentation of the whole tumour, where only marginal differences in Dice coefficient were seen across the range of sequence combinations. We can conclude that even single sequences may be sufficient for segmenting brain tumours with fidelity adequate for many downstream tasks.
The segmentation of tumour compartments-oedema, enhancing, and non-enhancing tumour-however presents a more complex picture. Single sequence models of oedema and enhancing tumour perform best with FLAIR and T1CE sequences, respectively. But models of two or three sequences exhibit less intuitive behaviour. Adding FLAIR to T1CE achieves whole tumour performance very close to that of the complete, four-sequence model, despite receiving only half the data. To that end, single T1CE MRI studies (such as in stealth imaging) may therefore benefit from the addition of a FLAIR sequence to enable more optimum visualization of the entire lesion to aid pre-operative planning. A two-sequence model of T1 and T1CE can delineate oedema well without the T2 or FLAIR typically used to identify it. Overall, these findings illustrate the ability of contemporary computer vision models to extract information from multiple sequences with greater efficiency than intuitive perception may suggest. 37,38 Segmenting enhancing tumour without contrast-enhanced imaging Strikingly, we found that models without the contrast enhancing sequence (T1CE) can still segment what has been hand-labelled by experienced neuroradiologists with full imaging datasets as the enhancing component of the tumour well, not least with performance largely invariant to the size, shape, and neuroanatomical location of the enhancing component. This introduces the possibilityacross both research and clinical practice-to make approximate inferences about the anatomy of enhancing components without the use of contrast. Moreover, that a model can identify what has been termed the 'enhancing' tumour, 19,38 without any information about its enhancing properties, reveals the presence of non-intuitive imaging features that could render the enhancing component quantifiable without the use of contrast. This challenges the current dogma of 'enhancing tumour', given a machine can identify it without the administration of intravenous agents ordinarily required to reveal it. Further investigation of this possibility is warranted, including the detectability of the presence of any degree of enhancement. These findings also illustrate a clinically important opportunity in oncological imaging when contrast enhanced imaging cannot be acquired, not least in situations of repeated follow-up where the over-use of contrast should ideally be limited, for example to minimize Gadolinium retention in paediatric patients. We note recent research on completing image sets synthetically may be fruitful in this domain, [39][40][41][42] as well as a wider body of literature aiming to reduce the requirement for contrast. 43

Limitations
In our systematic evaluation of the ability of deep learning models to identify brain tumours with varying degrees of sequence-incomplete data, we opted to use one single selfconfiguring architecture-nnU-Net. 28,29 The use of this  software is well justified given its validated performance across many domains of medical imaging. 30 But segmentation models are a rapidly evolving field, and so it is possible that other architectures might perform differently, perhaps even superiorly, to that used here. It is however important to note that our aim was not to identify the definitive 'best' tumour segmentation model but to quantify the impact of sequence completeness. Our aim is to determine how such models could perform in real-world clinical situations where 'perfect' data rarely exists, quantifying their appropriateness for translation to the clinical frontline. Furthermore, BraTS training data includes only preoperative imaging, yet it is plausible that much of the value in segmentation models may lie in longitudinal follow-up including that of postoperative resection appearances. Whilst we included a selection of postoperative imaging in our additional external validation, a more dedicated evaluation in the postoperative setting should form an area for future investigation.

Conclusion
Automated segmentation models can characterize tumours in real-world clinical situations of incomplete imaging data remarkably well. Such models are also able to identify enhancing tumour without the use of contrast-enhanced imaging, potentially providing clinical guidance in circumstances where contrast administration is contraindicated or where its repeated use should be minimized. This opens the way to quantifying enhancing components without the administration of intravenous agents, not least invites a revision of the notion of tumour enhancement if the same information can be extracted without contrast. Its applicability includes not just prospective scenarios wherein a full scan may not be possible such as patients unable to receive intravenous contrast but also applies to historical datasets where certain sequences might not have been acquired. Out-of-sample validation illustrates strong generalizability, across non-isotropic acquisitions and even on complex postoperative imaging where tumours have recurred. Translation of such models to the clinical frontline for response assessment-where complete data is a rarity-may be easier than hitherto believed.

Supplementary material
Supplementary material is available at Brain Communications online. , and T1CE with predicted segmentation overlayed (third row) for each available scanning session. T1 and T2 images are not shown as were not available for all imaging sessions. Per the colour key, red depicts enhancing tumour, blue is non-enhancing tumour, and green is oedema. (A) Mid-2010 imaging shows the FLAIR (originally coronal acquisition, but here reconstructed into axial with our super-resolution pipeline for visualization) does not fully cover the posterior margin of the lesion (white arrow). (B) Early 2013 imaging shows the T1CE in some planes of acquisition did not fully cover the cortical surface (note the perfectly vertical line on either side of the brain cortical surface) and thus is super-resolved by using other sequences to recover this (white arrow). (E) Early 2014 FLAIR image demonstrates suboptimal image quality, and yet the segmentation model still delineates the tissue components subjectively well. (H-I) T1CE images undertaken during late 2014 and 2015, respectively, show radical difference in image contrast, but that the segmentation model still performs subjectively well. Moreover, in (I), the model still recognizes the surgical cavity not to be lesion, despite never being trained with post-operative imaging.